Note: for all answers to questions, place all of your code in “chunks” (provided below) and submit a completed html document to web campus by “knitting” your Rmarkdown document. Insert your name and date in the information at the top!!
setwd("/Users/griffinshelor/Documents/Schoolwork/UNR/GEOG616/Labs/Lab2")
mydata <- read.csv("mydata.csv")
plot(mydata$sqft, mydata$cost)
Using the dataframe you loaded in Question 1 - Use the hist() function to plot a histogram of costs with a title of “Cost Distribution”, and an x-axis label of “Cost ($)”
hist(mydata$cost, main = "Cost Distribution", xlab = "Cost ($)")
Using the dataframe you loaded in Question 1 - Create a linear model looking at Cost as a function of Square Footage. Summarize your model and describe what you found.
mydata_lm <- lm(cost ~ sqft, data = mydata)
plot(mydata_lm)
summary(mydata_lm)
##
## Call:
## lm(formula = cost ~ sqft, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.499 -19.302 2.406 28.019 80.607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.7504 60.3477 1.172 0.2621
## sqft 0.1878 0.0664 2.828 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.18 on 13 degrees of freedom
## Multiple R-squared: 0.3809, Adjusted R-squared: 0.3333
## F-statistic: 7.997 on 1 and 13 DF, p-value: 0.01425
## The minimum and maximum residuals are not similarly far apart from the median. This indicates a possible skew by values with lower square footage that have lower costs than expected. The Q-Q residual plot shows most points near the straight line, with more outliers being near the lower end of square footage than the higher end. The t value and P value for the intercept indicate that it is likely that the intercept is near 70.75. The t and P values for the sqft coefficient indicate that it is unlikely that our coefficient for sqft is 0, indicating there is some statistically significant relationship between sqft and cost.
From the graph in Question 1C it looks there is a possible increase in cost in buildings that are larger than 1000 square feet.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
mydata <- mydata |>
mutate(building_size = case_when(sqft < 1000 ~ as.factor("low"),
TRUE ~ as.factor("high")))
size_lm <- lm(cost ~ building_size, data = mydata)
plot(size_lm)
summary(size_lm)
##
## Call:
## lm(formula = cost ~ building_size, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.773 -24.973 1.527 24.777 70.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.47 13.07 16.329 4.82e-10 ***
## building_sizehigh 91.03 25.32 3.596 0.00326 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.36 on 13 degrees of freedom
## Multiple R-squared: 0.4986, Adjusted R-squared: 0.4601
## F-statistic: 12.93 on 1 and 13 DF, p-value: 0.003259
## The t and p values for both coefficients listed indicate there is likely a statistically significant relationship between building size and cost. The minimum residual being further from the median than the maximum residual indicates that there is a possible tail in the residuals. This could indicate skew. The Q-Q residual plot shows most points near the straight line, also indicating that there is a relationship between relative building size and cost. However, there is again an obvious outlier at the lower left corner of the Q-Q plot.This could affect the adjusted R^2 and explain why it is 0.46 as opposed to a higher value which would represent a stronger correlation.
sqft_size_lm <- lm(cost ~ sqft + building_size, data = mydata)
plot(sqft_size_lm)
summary(sqft_size_lm)
##
## Call:
## lm(formula = cost ~ sqft + building_size, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.074 -21.129 -2.127 26.385 68.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 186.70682 88.04107 2.121 0.0555 .
## sqft 0.03361 0.10925 0.308 0.7636
## building_sizehigh 79.29675 46.28620 1.713 0.1124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.95 on 12 degrees of freedom
## Multiple R-squared: 0.5025, Adjusted R-squared: 0.4196
## F-statistic: 6.061 on 2 and 12 DF, p-value: 0.01515
## None of the P values or t values indicate a statistically significant relationship. The intercept comes closest, which makes sense based on the other models showing a relationship where cost goes up as building size goes up, whether building size is represented by specific square footage values or building size as factors. This is likely because sqft and building_size indicate the same attribute in different ways. This means that adding one of those variables to the model when the other is already present means we are not adding new information to the model. The Q-Q residuals plot is also very similar to the same plot from the previous models, indicating that cost does increase with building size, but not in a way that is different from either size variable being modeled on its own.
#Text-only answer
## The model using building_size as a predictor of cost from 4b is the best model because while the Q-Q residual plot looks very similar to the model from question 3, the p-values are lower for the intercept and for building size, which indicates a much stronger relationship between building size and cost, as well as a much stronger indication that the true intercept is approximately 304.5.
The Orings dataset from the Stat2Data package gives data on the damage that occurred in US Space Shuttle launches prior to the Challenger launch 33 years ago on January 28, 1986. The pre-launch charts for this mission that determined the go-no/go for launch contained only the data in rows 1,2,4,11,13,18.
library(Stat2Data)
data("Orings")
Orings_subset <- Orings[c(1,2,4,11,13,18),]
plot(Orings_subset$Temp, Orings_subset$Failures)
plot(Orings$Temp, Orings$Failures)
## In the full dataset, there was a much higher rate of failure for launches at below 65 degrees, but this may not be a meaningful conclusion since in the full data set a much smaller proportion of the launches happened at below 65 degrees. It is possible that the launches at below 65 degrees were outliers that would tend towards a more normal failure rate as more launches happened. The subset only included launches at above 65 degrees, so it does not provide any comparison at all, let alone a meaningful one.
# install.packages("datasets")
library(datasets)
data("attitude")
plot(attitude$complaints, attitude$rating)
rating_complaints_lm <- lm(rating ~ complaints, data = attitude)
plot(rating_complaints_lm)
summary(rating_complaints_lm)
##
## Call:
## lm(formula = rating ~ complaints, data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8799 -5.9905 0.1783 6.2978 9.6294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.37632 6.61999 2.172 0.0385 *
## complaints 0.75461 0.09753 7.737 1.99e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.993 on 28 degrees of freedom
## Multiple R-squared: 0.6813, Adjusted R-squared: 0.6699
## F-statistic: 59.86 on 1 and 28 DF, p-value: 1.988e-08
## The p values for the coefficients indicate that there is a strong relationship between complaints and rating. The scale-location plot does not indicate heteroscedasticity. The Q-Q residual plot does indicate light tails on both ends of the dataset, meaning that the residuals get unusually larger as complaints get towards the extreme lows or highs of the dataset. However, most of the points in that plot are close to the line, so they are likely not much of an issue and don't necessarily violate the assumption of a normal distribution.
For the dataset in question 6, Are the number of raises employees get associated with the overall rating (use a linear model)?
rating_raises_lm <- lm(rating ~ raises, data = attitude)
plot(rating_raises_lm)
summary(rating_raises_lm)
##
## Call:
## lm(formula = rating ~ raises, data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.123 -7.109 1.377 5.741 19.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.9778 11.6881 1.709 0.098468 .
## raises 0.6909 0.1786 3.868 0.000598 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10 on 28 degrees of freedom
## Multiple R-squared: 0.3483, Adjusted R-squared: 0.325
## F-statistic: 14.96 on 1 and 28 DF, p-value: 0.0005978
## The P value for the raises coefficient indicates a statistically significant relationship between raises and ratings. The intercept's P-value does not indicate a statistically significant relationship, but the Residuals vs Fitted plot supports a linear relationship.
For the models generated in questions 6 and 7, we will evaluate model fits and assumptions.
plot(rating_complaints_lm)
plot(fitted(rating_complaints_lm) ~ attitude$rating)
cor(fitted(rating_complaints_lm), attitude$rating)
## [1] 0.8254176
## The residuals vs fitted plot from the LM indicates that the data does not display evidence of a non-linear relationship. However, the Q-Q residual plot has tails at both ends of the plot which indicate that the data does deviate from a normal distribution.
plot(rating_raises_lm)
plot(fitted(rating_raises_lm) ~ attitude$rating)
cor(fitted(rating_raises_lm), attitude$rating)
## [1] 0.590139
## Similar to the Question 6 model, the Residuals vs Fitted plot does not display evidence of a non-linear relationship. While the Q-Q Residuals plot does display a few outliers, they are closer to the line and the plot does not display as many tails like the Question 6 model does. This is evidence that the residuals more closely resemble a normal distribution.
## Due to the difference between the Q-Q resdiduals plots, the Question 7 model predicting rating from raises is better since it does not include tails or deviate from the straight dashed line.
The following data represent the total number of aberrant crypt foci (abnormal growths in the colon) observed in nine rats that were given a single does of the carcinogen azoxymethane and examined after six weeks.
87 53 72 90 78 85 83 99 49
crypt_foci <- c(87,53,72,90,78,85,83,99,49)
crypt_foci_mean <- mean(crypt_foci)
crypt_foci_mean
## [1] 77.33333
crypt_foci_sd <- sd(crypt_foci)
crypt_foci_sd
## [1] 16.72573
set.seed(456)
new_means_seq <- seq(crypt_foci_mean + 5, crypt_foci_mean + 50, by = 1)
crypt_foci_samples_df <- data.frame(new_means = new_means_seq, p_values = 0)
## T-tests for every new mean in the sequence
for(i in 1:length(new_means_seq)){
rnorm_dist <- rnorm(length(crypt_foci), new_means_seq[i], sd = crypt_foci_sd)
crypt_foci_ttest <- t.test(crypt_foci, rnorm_dist)
crypt_foci_samples_df$p_values[i] <- crypt_foci_ttest$p.value
}
crypt_foci_samples_df$mean_diff <- crypt_foci_samples_df$new_means - mean(crypt_foci)
plot(crypt_foci_samples_df$mean_diff, crypt_foci_samples_df$p_values)
abline(h = 0.05, col = 'red')
## The mean differences become statistically significant at increment 14, but do not become consistently significant until increment 25.
Read the two papers in the Lab 1 folder by Maass et al. (2007) and Beck et al. (2014). Recognizing that these are very different topics and research, discuss the similarities in terms of the influence of bias on scientific results using examples from the papers. Limit your answer to approximately 300 words.
#Text-only, ~300 words
## The usage of expert evaluation in the Beck et al. (2014) paper but not in the Maass et al. (2007) paper resulted in some similarities in terms of its impact on the results and conclusions of the respective papers. In Maass et al., expert opinions were not used to evaluate the beauty or quality of the soccer goals shown to study participants, but one of the conclusions by Maass et al. was that showing soccer goals to non-experts likely resulted in lower spatial effects in the results and that showing the same goals to soccer experts would likely result in more variance in how “beautiful” a goal was based on soccer experts being more attuned to what distinguishes one soccer goal from another. The Beck et al. (2014) paper included expert analysis of their SDM models and found that Maxent models with fewer background points were graded better by the expert than models with many background points. However, this inclusion of expert analysis could potentially be subject to bias on its own. The authors found that when using fewer background points, their AUC_Maxent reflected a decrease in model quality, an opposite relationship to the trend in expert grading. The goal of using expert assessment to measure model quality in the Beck et al. paper was to discover a way to minimize spatial bias of the GBIF and other similar distributional database. However, expert assessment and model accuracy measurements such as AUC disagreed over whether this occurred. The authors concluded that this was a result of AUC being affected by data bias and thus is not reliable as an indicator of model quality. One question that does not get answered by the Beck et al. paper is whether the expert analysis used to evaluate model quality could potentially be subject to similar spatial bias as detected in the Maass et al. paper on the subjective beauty of soccer goals and violence of movie scenes.